English Text Classification by Authorship and Date
نویسندگان
چکیده
We performed two experiments with statistical techniques for classifying documents by date and author, using large bodies of publicly-available texts. In one experiment, we produced a Markov chain of every United States Supreme Court opinion ever written, and evaluated its ability to classify American judicial opinions by decade of authorship. In the other, we examined the performance of two sets of quasi-linguistic features in classifying op-ed articles from The New York Times among four authors with a supportvector machine. The results in each case were encouraging. With the Markov chain, we could correctly identify the decade of authorship of a Supreme Court opinion within one decade 85 percent of the time. With the two quasi-linguistic feature sets, we were able to measure the equivocation between pairs of authors and observe some interesting effects when more features were collected.
منابع مشابه
Who Wrote this Novel? Authorship Attribution across Three Languages
Based on different writing style definitions, various authorship attribution schemes have been proposed to identify the real author of a given text or text excerpt. In this article we analyze the relative performance of word types or lemmas assigned to represent styles and texts. As a second objective we compare two authorship attribution approaches, one based on principal component analysis (P...
متن کاملAuthorship Attribution: A Comparative Study of Three Text Corpora and Three Languages
The first objective of this paper is carry out three experiments intended to evaluate authorship attribution methods based on three test-collections available in three different languages (English, French, and German). In the first we represent and categorize 52 text excerpts written by nine authors and taken from 19th century English novels. In the second we work with 44 segments from French n...
متن کاملCross-Genre Authorship Verification Using Unmasking
This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date...
متن کاملGenetic Optimization of Keywords Subset in the Classification Analysis of Texts Authorship
The genetic selection of keywords set, the text frequencies of which are considered as attributes in text classification analysis, has been analyzed. The genetic optimization was performed on a set of words, which is the fraction of the frequency dictionary with given frequency limits. The frequency dictionary was formed on the basis of analyzed text array of texts of English fiction. As the fi...
متن کاملAutomated Authorship Attribution Using Advanced Signal Classification Techniques
In this paper, we develop two automated authorship attribution schemes, one based on Multiple Discriminant Analysis (MDA) and the other based on a Support Vector Machine (SVM). The classification features we exploit are based on word frequencies in the text. We adopt an approach of preprocessing each text by stripping it of all characters except a-z and space. This is in order to increase the p...
متن کامل